Exploratory Data Analysis on White Wine Data Set by Jimin Yu

## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

The white wine data set has information on 4898 wines that were graded by wine experts. The data set contains information on a given wine’s acidity, sugar concentration, pH, alcohol concentration, etc. The 12th variable is quality of each wine graded by experts from 0 (bad) to 10 (excellent).

Univariate Plots Section

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

The first column is just a column index so I took it out. Let’s take a look at the distributions of the variables.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

Minimum and maximum wine ratings are 3 and 9, respectively. Most wines received a rating of 6. Very few wines have received ratings of 3 or 9. Let’s take a look at other variables.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

Log-transformed the variable to better visualize distribution. Most wines have volatile acidity (amount of acetic acid) of around 0.3. It is said in the white wine document that too high of acetic acid can lead to an unpleasant, vinegar taste. Would I observe an inverse relationship between wine rating and volatile acidity later on?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

Seems to peak around 0.3 with a few outliers to the right. It is said that citric acid can add freshness and flavor to wines. I’m interested to see the relationship between wine rating and this variable as well.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

Log-transformed the variable and cut off outliers for better visualization. The distribution appears bi-modal and has peaks around 1.7 and 8.5. There’s an insane outlier (65.8g of sugar). Definitely interested to see general relationship between wine quality and sugar concentration.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

Log-transformed for better visualization. Chlorides seem to peak around 0.044.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

Squared-root-transformed for better visualization. Peaks around 130 and has a pretty crazy outlier (440.0). Sulfur Dioxide (SO2) prevents microbial growth and the oxidation of wine. The white wine document says that free SO2 concentration of over 50 can be detected in the nose and taste of wine. I’m interested in seeing how this variable also affects wine rating.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

All wines are between 3-4 on the pH scale. Since the range is fairly narrow, I don’t think it will influence wine rating by much but we’ll see.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

Seems to peak around 0.5. Sulphate is a wine additive that can contribute to SO2 levels. I expect this to correlate quite strongly with SO2 level. Would they also have similar effects on wine rating, if any?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Alcohol is distributed across fairly large range (8 - 14.20). How would alcohol affect wine quality?

Univariate Analysis

What is the structure of your dataset?

The data has 4898 observations of 12 variables. 11 of these variables represent properties of a wine such as acidity, sugar concentration, pH, alcohol concentration, etc. The 12th variable is quality of each wine graded by experts from 0 (bad) to 10 (excellent). All of the variables are continous except for wine quality. Most of them are unimodal and a few of them have outliers.

What is/are the main feature(s) of interest in your dataset?

The main features of interest are which variable(s) influence wine quality/rating significantly and how the changes in these variables influence the quality of wine.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I log-transformed and squared-root-transformed several variables to make skewed distributions less skewed to see the patterns in the data more clearly.

Bivariate Plots Section

#chaning wine quality from numeric to factor.
w$quality = as.factor(w$quality)

Here I change the variable “quality” to factor so I can create proper boxplots.

##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

It seems to me that wines that received ratings of 5 and higher have fairly narrow range of volatile acidity compared to those that received ratings of 3 or 4. This makes sense because volatile acidity (amount of acetic acid) is said to be unpleasant at high levels. What’s important to note here and throughout the analysis is that there are only five wines that received a rating of 9 so it may be difficult to tell what the wine quality’s true distribution looks like.

I cut off the outliers to visualize the distribution better. Right off the bat I can see there isn’t a simple, linear relationship between sugar and wine quality. However, for wines of ratings greater than or equal to 5, it seems that wine quality increases as sugar content decreases. However, since the overall trend isn’t linear, there may be other variables that are influencing wine quality here.

## [1] -0.116647

As expected, there is a weak negative correlation (using Spearman) of -0.117 between wine quality and sugar concentration for wines that received 5 or higher ratings. Here and throughout the analysis, Spearman correlation is used because it is less sensitive to outliers and do not assume normal distribution.

## [1] -0.3144885

The boxplot seems to indicate that higher quality wines tend to have less sugar concentration. As expected from the boxplot, there is a moderate negative correlation of -0.314 between salt concentration and wine quality.

## [1] -0.1966803

The boxplot and the correlation coeficient shows that a weak negative relationship exists between wine quality and total sulfur dioxide. This makes sense because too high of a free sulfur dioxide concentration (above 50) is said to be detectable by taste and nose and is unpleasant.

## [1] 0.03331897

I don’t see a noteworthy trend here.

## [1] 0.4403692

I see the strongest positive correlation yet seen between alcohol and wine quality. The correlation is even more apparent when you see the boxplot for wines that have ratings of 5 or greater. Is it just alcohol that is influencing the wine quality or is alcohol correlated with other features that also influence wine quality? To answer this, let’s see a correlation matrix of all features.

From the correlation matrix, you can tell that alcohol is negatively correlated with residual.sugar, chlorides, total.sulfur.dioxide. This means that wines with higher alcohol concentration tend to have less sugar, salt, and total sulfur dioxide. Since all of these variables are negatively correlated with wine quality (“numQuality”), wines that have low sugar, salt, and total sulfur dioxide are more likely to be high quality wines.Therefore, alcohol concentration may show strongest negative correaltion with wine quality simply due to the fact that wines with high alcohol concentration tend to have low sugar, salt, and total sulfur dioxide.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Volatile acidity, sugar, salt, and total sulfur dioxide are negatively correlated with wine quality. Alcohol, on the other hand, is positively correlated with wine quality.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Alcohol is negatively correlated with all other variables aforementioned (volatile acidity, sugar, salt, and total sulfur dioxide) !

What was the strongest relationship you found?

In terms of correlation, alcohol had the strongest relationship with wine quality.

Multivariate Plots Section

In the last section, we found variables that are correlated with wine quality. Now I’m curious to see how wine ratings are distributed among combinations of these variables (total sulfur dioxide, chlorides, volatile acidity, residual sugar, alcohol). Before proceeding, let’s take a look at the correlation matrix one more time to make sure we haven’t missed anything important.

Besides the five variables we took a note of, there is one more variables that show weak, negative correlation with wine quality: fixed acidity. I’ll include this variable for exploration in this section.

w$newQualityLevel = cut(w$numQuality, breaks = c(0, 4, 7, 10))

For the following section, I divided the wine quality ratings (from 1 to 10) into 3 intervals ( (0,4], (4, 7], (7,10] ) to better visualize patterns in the data. The variable newQualityLevel stores this information. I’ll refer to wines in range (7, 10] as “high quality” wines, wines in range (4, 7] as “medium quality” wines, wines in range(0, 4] as “low quality” wines.

Here I’m looking at the distributions of wines of varying qualities across total sulfur dioxide and chlorides concentration. High quality wines are mostly distributed from around 80 to 180 mg/dm^3 of total sulfur dioxide. Medium quality wines are distributed a bit more widely from around 70 to 250 mg/dm^3. Low quality wines are distributed from 50 to 200 mg/dm^3. There doesn’t seem to be notable separation of wine qualities across chlorides concentrations.

There doesn’t seem to be notable separation of wine qualities across both the residual sugar concentration and volatile acidity.

It appears that the majority of high quality wines have alcohol concentraion of 10 - 13 % while lower quality wines range more evenly from 8.5 - 13 %. There doesn’t seem to be notable separation of wine qualities across fixed acidity.

Out of the six variables we looked at, only alcohol and total sulfur dioxide seem to have visually distinct distributions of wine ratings. Let’s look at both of these variables in a single plot.

As expected, differences in the distributions of wine ratings are visible across the two variables.

Seeing how total sulfur dioxide influences wine distributions, I’m curious about free sulfur dioxide’s influence on wine distributions. The white wine documentation says that free sulfur dioxide (SO2) prevents microbial growth and the oxidation of wine. It also says that SO2 concentration of over 50 ppm becomes evident in the nose and the taste. Let’s see how free SO2 concentration influences the distributions of wine ratings.

There doesn’t seem to be notable separation of wine qualities. I wonder if the ratio of free SO2 to total SO2 concentration may tell a better story. Let’s find out.

The pattern is clearer! High quality wines are mostly distributed from around 0.15 to 0.4. Medium quality wines are distributed a bit more widely from around 0.1 to 0.4. Low quality wines seem to be mostly distributed from 0.04 to 0.3.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

There were three features that strengthened each other: alcohol, total sulfur dioxide, and the ratio of free SO2 to total SO2. The distribution of highest quality wine separated from the rest of the wine groups in that most high quality wines had around 10 - 13 % alcohol while many medium and low quality wines had alcohol well below 10 % alcohol. The distribution of high quality wines were also distinct from lower quality wines in terms of total sulfur dioxide concentration and the ratio of free SO2 to total SO2 concentration, although not as distinctly as alcohol.

Final Plots and Summary

Plot One

Description One

The distributions (median, 1st quartile, 3rd quartile) of the amount of alcohol wines of a given rating contain continue to rise as we move from quality ratings of 5 to 9. This indicates that high quality wines are more likely to have higher alcohol concentration.

Plot Two

Description Two

Similar to the first plot, the plot above shows that high quality wines (wines of ratings 8 and 9) tend to have higher alcohol concentration than lower quality wines. Specifically, the plot shows that high quality wines tend to have alcohol concentration of 10.5 to 13.5 %. Medium quality wines (wines of ratings from 5 to 7) have around 8.5 to 13 % alcohol and low quality wines (wines of ratings 3 and 4) have around 8.5 to 12 % alcohol. Fixed Acidity, which does not have a big impact on wine quality distributions, was chosen as y-axis to make wine quality distributions across alcohol concentration stand out.

Plot Three

Description Three

The plot above shows that high quality wines (wines of ratings 8 and 9) are mostly distributed from 0.15 to 0.4 SO2 ratio (free sulfur dioxide / total sulfur dioxide). Medium quality wines (wines of ratings from 5 to 7) are distributed more widely from 0.07 to 0.42 SO2 ratio. Low quality wines (wines of ratings 3 and 4) are mostly distributed from around 0.03 to 0.27 SO2 ratio. Fixed Acidity, which does not have a big impact on wine quality distributions, was chosen as y-axis to make wine quality distributions across the SO2 ratio stand out.

Reflection

At first, I struggled with what type of plot to choose for data exploration. Because wine quality is a discrete variable, my usual go-to scatterplot was unusable. When I tried boxplots, however, it was very easy to recognize patterns in the data. Another difficulty that I faced was that there weren’t many wines that received ratings of 9 and 3 so it was hard to draw conclusion on how these wines are different from other wines. This is why I decided to merge wines to form three groups. Luckily, when I merged wine ratings, there were recognizable patterns among the groups of wines of different ratings.

The most surprising finding was that high quality wines tend to have higher alcohol concentration - which I find distasteful- and less of savory ingredients like sugar and salt.

Although I have tried many features to discover underlying patterns in the data, I must admit that it is not comprehensive. One thing to try in the future is fitting a multiple regression to the data. Because a multiple regression’s coefficients carry information about how one variable affects the variable of interest (in this case wine rating) when all other variables are held equal, the regression can shed new insight. However, in order to know if the coefficients are any good, one may need to check the accuracy of the model and make sure it is good enough. Doing this, however, may be outside the scope of exploratory data analysis.